Feature Engineering¶

The main objectives for this notebook are:

  • Develop a set of features that have a potential to improve your model's performance
  • Investiage the relationships between your new features and your target

The skills that you need to showcase:

  • Your domain expertise
  • Your data wrangling skills

How to stand out?¶

  1. Engineer a well argued (if you have sources that's bonus point x2) feature
  2. Validate your features after engineering
  3. Don't use blind (auto) feature engineering - waste of time
  4. Design a feature engineering pipeline at the end of the notebook

Imports¶

In [ ]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.io as pio
import seaborn as sns
from feature_engine.selection import SmartCorrelatedSelection

import polars as pl

# Path needs to be added manually to read from another folder
path2add = os.path.normpath(
    os.path.abspath(os.path.join(os.path.dirname("__file__"), os.path.pardir, "utils"))
)
if not (path2add in sys.path):
    sys.path.append(path2add)

from feature_engineering import (
    aggregate_node_features,
    feature_predictive_power,
    get_graph_features,
)

pio.renderers.default = "notebook"
In [ ]:
data = pl.read_parquet('../data/supervised_clean_data.parquet')
calls = pl.read_json('../data/supervised_call_graphs.json')
In [ ]:
data.head(1)
Out[ ]:
shape: (1, 13)
_idinter_api_access_duration(sec)api_access_uniquenesssequence_length(count)vsession_duration(min)ip_typenum_sessionsnum_usersnum_unique_apissourceclassificationis_anomaly
i64strf64f64f64i64strf64f64f64strstrbool
0"1f2c32d8-2d6e-…0.0008120.00406685.6432435405"default"1460.01295.0451.0"E""normal"false
In [ ]:
calls.head(1)
Out[ ]:
shape: (1, 2)
_idcall_graph
strlist[struct[2]]
"1f2c32d8-2d6e-…[{"1f873432-6944-3df9-8300-8a3cf9f95b35","5862055b-35a6-316a-8e20-3ae20c1763c2"}, {"8955faa9-0e33-37ad-a1dc-f0e640a114c2","a4fd6415-1fd4-303e-aa33-bb1830b5d9d4"}, … {"016099ea-6f20-3fec-94cf-f7afa239f398","6fa8ad53-2f0d-3f44-8863-139092bfeda9"}]

Since the main dataset already contains engineered features, there's not much opportunity to do feature engineering there. So, additional features will be created using the graph data that comes from supervised_call_graphs.json

Process Graph Data¶

In [ ]:
calls_processed = (
    calls.with_columns(
        pl.col("call_graph").list.eval(
            pl.element().struct.rename_fields(["from", "to"])
        )
    )
    .explode("call_graph")
    .unnest("call_graph")
)

calls_processed.head()
Out[ ]:
shape: (5, 3)
_idfromto
strstrstr
"1f2c32d8-2d6e-…"1f873432-6944-…"5862055b-35a6-…
"1f2c32d8-2d6e-…"8955faa9-0e33-…"a4fd6415-1fd4-…
"1f2c32d8-2d6e-…"85754db8-6a55-…"85754db8-6a55-…
"1f2c32d8-2d6e-…"9f08fee1-953c-…"876b4958-7df1-…
"1f2c32d8-2d6e-…"857c4b20-3057-…"857c4b20-3057-…

Feature Engineering¶

We can see that each graph has a separate _id that can be later used to join to the main dataset. A graph consistst out of source and destination nodes which refer to the available API calls.

Basic Graph Level Features¶

The most basic graph-level that we can engineer are:

  • Number of edges (connections)
  • Number of nodes (APIs)

These features can be useful since most behaviours are going to have a "normal" range of APIs that they contact. If this number is too large or too small, this might be an indication of anomalous activity.

In [ ]:
graph_features = calls_processed.group_by('_id').agg(
    pl.len().alias('n_connections'),
    pl.col('from'),
    pl.col('to')
).with_columns(
    pl.concat_list('from', 'to').list.unique().list.len().alias('n_unique_nodes')
).select([
    '_id',
    'n_connections',
    'n_unique_nodes'
])

graph_features.sample(3)
Out[ ]:
shape: (3, 3)
_idn_connectionsn_unique_nodes
stru32u32
"79c18974-2983-…6831
"ab6f299d-be1c-…1210
"5e8cc48d-d2bc-…1710

Node Level Features¶

Since graphs consist out of nodes, we can engineer a set of features around specific nodes (APIs). We can calculate:

  • Node degrees - the number of edges that come from/into a node. Very highly connected nodes can look anomalous.
  • Node centrality - there are various centrality measures (e.g. Page Rank) but they all try to estimate how important to the whole graph is a specific node. This feature could be useful because a behaviour pattern that doesn't touch any of the "central" APIs would look anomalous

These features can be broken down into:

  • global features - measure node attributes across all the graphs
  • local features - measure node attributes across a specific graph
In [ ]:
calls_processed = calls_processed.with_columns(
    global_source_degrees = pl.len().over(pl.col('from')),
    global_dest_degrees = pl.len().over(pl.col('to')),
    local_source_degrees = pl.len().over(pl.col('from'), pl.col('_id')),
    local_dest_degrees = pl.len().over(pl.col('to'), pl.col('_id'))
)

calls_processed.sample(3)
Out[ ]:
shape: (3, 7)
_idfromtoglobal_source_degreesglobal_dest_degreeslocal_source_degreeslocal_dest_degrees
strstrstru32u32u32u32
"290fe43b-8b93-…"756ab2fe-a386-…"27c07c16-5720-…68081222323
"ea0d02f5-ef61-…"90a655af-9f52-…"a449d369-17b1-…99822013118
"f06fbc92-2a0e-…"43dcab78-0f41-…"1d768e1f-ee4c-…288510353716

Now that the node-level features are calculated, we need to aggregate them for a specific graph (_id). When aggregating, we can calcualte average, std, min, and max statistics for every feature to capture the distribution well.

In [ ]:
node_features_agg = aggregate_node_features(
    calls_processed,
    node_features=[
        "global_source_degrees",
        "global_dest_degrees",
        "local_source_degrees",
        "local_dest_degrees",
    ],
    by="_id",
)

graph_features = graph_features.join(node_features_agg, on="_id")
In [ ]:
graph_features.head()
Out[ ]:
shape: (5, 19)
_idn_connectionsn_unique_nodesavg_global_source_degreesmin_global_source_degreesmax_global_source_degreesstd_global_source_degreesavg_global_dest_degreesmin_global_dest_degreesmax_global_dest_degreesstd_global_dest_degreesavg_local_source_degreesmin_local_source_degreesmax_local_source_degreesstd_local_source_degreesavg_local_dest_degreesmin_local_dest_degreesmax_local_dest_degreesstd_local_dest_degrees
stru32u32f64u32u32f64f64u32u32f64f64u32u32f64f64u32u32f64
"8c603b36-76b3-…60275843.28333363320715330.2610467802.75273224166417.055385.6666671155.5865054.11113.462633
"538884b8-1a36-…11104150.45454555320719742.3376294079.27272748220138500.8300661.363636120.5045251.545455130.934199
"460ab8c1-b9ec-…6271794908.6140358320718230.1320135324.53118224167059.30017311.06060614812.57842511.4593313811.202301
"e6ab8dbb-bad2-…112458288.33035779320719626.03616310587.27678665224169107.7843025.1251144.2002257.6607141196.795008
"ea024be4-a2cb-…6233241.0645161596201.201951407.91935511151430.8000683.354839182.3891614.5483871113.67837

Feature Selection¶

Feature selection will be done using 2 steps:

  1. Quality checks - if the feature is constant or has too many missing values (>= 95%) it will be dropped
  2. Correlation analysis - if features have very high correlation (>= 95%) with each other, they can be dropped as well
In [ ]:
engineered_features = graph_features.columns[1:]
engineered_features
Out[ ]:
['n_connections',
 'n_unique_nodes',
 'avg_global_source_degrees',
 'min_global_source_degrees',
 'max_global_source_degrees',
 'std_global_source_degrees',
 'avg_global_dest_degrees',
 'min_global_dest_degrees',
 'max_global_dest_degrees',
 'std_global_dest_degrees',
 'avg_local_source_degrees',
 'min_local_source_degrees',
 'max_local_source_degrees',
 'std_local_source_degrees',
 'avg_local_dest_degrees',
 'min_local_dest_degrees',
 'max_local_dest_degrees',
 'std_local_dest_degrees']

Quality Checks¶

In [ ]:
null_counts = graph_features.null_count().transpose(include_header=True, header_name='col', column_names=['null_count'])
null_counts.filter(pl.col('null_count') > 0)
Out[ ]:
shape: (4, 2)
colnull_count
stru32
"std_global_sou…42
"std_global_des…42
"std_local_sour…42
"std_local_dest…42
In [ ]:
static_features = graph_features.select(engineered_features).std().transpose(include_header=True, header_name='col', column_names=['std'])
static_features.filter(pl.col('std') == 0)
Out[ ]:
shape: (0, 2)
colstd
strf64

Observations:

  • No features have missing values or are static

Impact

  • No features will be dropped for quality reasons

Correlation Analysis¶

As you can see, global degrees

In [ ]:
feature_corrs = graph_features.select(engineered_features).corr().to_pandas()
feature_corrs.index = feature_corrs.columns
matrix = np.triu(feature_corrs)
fig = plt.figure(figsize=(20, 10))
sns.heatmap(feature_corrs, annot=True, mask=matrix)
Out[ ]:
<Axes: >
No description has been provided for this image

We can see clear groups of highyl correlated features. Hence, let's apply SmartCorrelatedSelection to reduce the feature set of engineered features

In [ ]:
features_pd = graph_features.select(engineered_features).to_pandas().dropna()

tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.95,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

tr.fit(features_pd)

print('Features to drop:')
for f in tr.features_to_drop_:
    print(f)
Features to drop:
n_unique_nodes
std_global_dest_degrees
avg_local_source_degrees
max_local_source_degrees
avg_local_dest_degrees
max_local_dest_degrees
std_local_dest_degrees

Observations:

  • Engineered features have groups of high correlation

Impact

  • ['n_unique_nodes', 'std_global_dest_degrees', 'avg_local_source_degrees', 'max_local_source_degrees', 'avg_local_dest_degrees', 'max_local_dest_degrees' 'std_local_dest_degrees'] are dropped from the features list due to belonging to a high correlation set and having lower variance than the remaining feature

EDA for Remaining Engineered Features¶

In [ ]:
remaining_engineered_features = list(set(features_pd).difference(set(tr.features_to_drop_)))
graph_features = graph_features.join(data.select(['_id', 'is_anomaly']),  on='_id')
In [ ]:
scores = []
for f in remaining_engineered_features:
    print("Feature Analysis:", f)
    score = feature_predictive_power(graph_features, f, "is_anomaly")
    scores.append(score)
Feature Analysis: std_global_source_degrees
Predictive Power Score: 0.44369998574256897
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: min_global_source_degrees
Predictive Power Score: 0.5494999885559082
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: max_global_dest_degrees
Predictive Power Score: 0.5921000242233276
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: avg_global_source_degrees
Predictive Power Score: 0.328900009393692
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: min_local_dest_degrees
Predictive Power Score: 0.007799999788403511
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: max_global_source_degrees
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Predictive Power Score: 0.36739999055862427
Feature Analysis: avg_global_dest_degrees
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Predictive Power Score: 0.3370000123977661
Feature Analysis: min_global_dest_degrees
Predictive Power Score: 0.5932000279426575
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: n_connections
Predictive Power Score: 0.5871999859809875
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: std_local_source_degrees
Predictive Power Score: 0.5327000021934509
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Feature Analysis: min_local_source_degrees
Predictive Power Score: 0.0
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

In [ ]:
pd.Series(scores, index=remaining_engineered_features).sort_values(ascending=False)
Out[ ]:
min_global_dest_degrees      0.5932
max_global_dest_degrees      0.5921
n_connections                0.5872
min_global_source_degrees    0.5495
std_local_source_degrees     0.5327
std_global_source_degrees    0.4437
max_global_source_degrees    0.3674
avg_global_dest_degrees      0.3370
avg_global_source_degrees    0.3289
min_local_dest_degrees       0.0078
min_local_source_degrees     0.0000
dtype: float32

Observations:

  • Most of the engineered features have relatively highe predictiveness score
  • The most predictive features are global
  • Features with no predictive power measure minimum degrees of local graphs
  • Relationships between engineered features and the target are not-linear

Impact

  • min_local_dest_degrees and min_local_source_degrees can be dropped
  • Tree based models need to be used to capture the engineered relationships
In [ ]:
remaining_engineered_features = [f for f in remaining_engineered_features if f not in ['min_local_dest_degrees', 'min_local_source_degrees']]
print('Final engineered featureset:')
print(remaining_engineered_features)
Final engineered featureset:
['std_global_source_degrees', 'min_global_source_degrees', 'max_global_dest_degrees', 'avg_global_source_degrees', 'max_global_source_degrees', 'avg_global_dest_degrees', 'min_global_dest_degrees', 'n_connections', 'std_local_source_degrees']

Feature Engineering Pipeline¶

In [ ]:
selected_features = [
    "max_global_source_degrees",
    "avg_global_source_degrees",
    "min_global_dest_degrees",
    "std_local_source_degrees",
    "max_global_dest_degrees",
    "min_global_source_degrees",
    "std_global_source_degrees",
    "n_connections",
    "avg_global_dest_degrees",
]

calls = (
    (
        pl.read_json("../data/supervised_call_graphs.json")
        .with_columns(
            pl.col("call_graph").list.eval(
                pl.element().struct.rename_fields(["from", "to"])
            )
        )
        .explode("call_graph")
        .unnest("call_graph")
    )
    .with_columns(
        global_source_degrees=pl.len().over(pl.col("from")),
        global_dest_degrees=pl.len().over(pl.col("to")),
        local_source_degrees=pl.len().over(pl.col("from"), pl.col("_id")),
        local_dest_degrees=pl.len().over(pl.col("to"), pl.col("_id")),
    )
    .pipe(get_graph_features)
    .select(["_id"] + selected_features)
)

pl.read_parquet("../data/supervised_clean_data.parquet").join(
    calls, on="_id"
).write_parquet("../data/supervised_clean_data_w_features.parquet")

Summary¶

Feature Engineering Summary¶

  • 18 new features were engineered which have measured graph and node related features
  • Graph-level features measure the total size of the graphs
  • Node level features measure the degrees on global and local levels
  • 7 features were dropped due to high correlation within group
  • 2 more features were dropped due to low predictive power score

Implications for ML¶

  • Engineered and selected 9 features are theorised to be useful in the prediction task so should be included into the final model
  • Feature engineering pipeline was designed, so new data can be easily transformed